Machine Learning Project

Company Bankruptcy Prediction

Names:

Dataset: https://www.kaggle.com/fedesoriano/company-bankruptcy-prediction

Data: May/2021 Libraries

Understanding the distribution of all variables:

At this point, we create a Decision Tree classifier, applying the default parameters to train and predict our model. The default parameters for CART algorithm can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

SMOTE - Synthetic Minority Over-sampling Technique

What is SMOTE? SMOTE (Synthetic Minority Oversampling Technique) works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/

We apply the SMOTE technique of handling imbalance to help us improve the performance of the model.

Evaluation of the model

The Receiver Operating Characteristic (ROC) curve is a standard technique for summarizing classifier performance over a range of tradeoffs between true positive and false positive error rates (Swets, 1988).

The Area Under the Curve (AUC) is an accepted traditional performance metric for a ROC curve (Duda, Hart, & Stork, 2001; Bradley, 1997; Lee, 2000). The ROC convex hull can also be used as a robust method of identifying potentially optimal classifiers (Provost & Fawcett, 2001). If a line passes through a point on the convex hull, then there is no other line with the same slope passing through another point with a larger true positive (TP) intercept. Thus, the classifier at that point is optimal under any distribution assumptions in tandem with that slope.

The Mean Squared Error (MSE) is a measure of how close a fitted line is to data points. For every data point, you take the distance vertically from the point to the corresponding y value on the curve fit (the error), and square the value. Then you add up all those values for all data points, and, in the case of a fit with two parameters such as a linear fit, divide by the number of points minus two.** The squaring is done so negative values do not cancel positive values. The smaller the Mean Squared Error, the closer the fit is to the data. The MSE has the units squared of whatever is plotted on the vertical axis.

Another quantity that we calculate is the Root Mean Squared Error (RMSE). It is just the square root of the mean square error. That is probably the most easily interpreted statistic, since it has the same units as the quantity plotted on the vertical axis.

Key point: The RMSE is thus the distance, on average, of a data point from the fitted line, measured along a vertical line.

The RMSE is directly interpretable in terms of measurement units, and so is a better measure of goodness of fit than a correlation coefficient. One can compare the RMSE to observed variation in measurements of a typical point. The two should be similar for a reasonable fit.

From the confusion matrix, you can see that out of 2046 test instances, our algorithm misclassified only 114. This is 94 % accuracy

From the confusion matrix, you can see that out of 3960 test instances, our algorithm misclassified only 183. This is 95 % accuracy

In Decision Tree classifiers, we decide on a split point in the algorithm and calculate some metric ie entropy or gini impurity at a given node for the left and right node after the split. We show the perfomance of the compared trees in terms of accuracy for a define depth for each tree test with gini and entropy aspects.

Visualise the Decision Trees

Useful Information concerning the default and weighted trees

Problems with imbalanced data in Decision Trees
As the classes of our model are strongly imbalanced, the model tends to learn the most common class (class 0 - not bankrupt) and it doesn’t abstract information from the other class.

Proposed solutions for this premise

SMOTE : we prefered to use this approach to avoid problems with imbalanced data. According to implementation of DT when handling imbalanced data, SMOTE (Synthetic Minority Oversampling Technique) works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/

However if we use CV, we fear that the synthetic data may have more of the same class within a single fold and this would bring another problem of overfitting. To handle this, we can use SMOTE with the abillity to randomise in the process of balance the classes for those what will use 10 Fold classification.

Random Over Sampler: It is a naive method where the minority class that has low examples is generated and randomly resampled.

Conclusion

Although the DT is effective in balanced classifications, it doesnt do well on imbalanced dataset. Since the dataset is imbalanced, one class is dominant and the minority us being ignored, the class proportions are mixed during the splitting proceess. So we tried of overcome this imbalance by modifying the dataset used at split points and take into account the importance of each class. We therefore split at 50% and derived a balanced decision tree for this exercise.